Machine Learning Final Project

Authors
Affiliation

Adam Lodrik

University of Lausanne

Favre Stefan

Macaraeg Jeff

Published

May 16, 2024

Abstract

The following machine learning project focuses on…

1 Introduction

• The context and background: course, company name, business context.

During our 1st master year as students in Management - orientation Business Analytics, we have had the opportunity to attend some lectures of Machine Learning for Business Analytics. In content of this class, we have seen multiple machine learning techniques for business context, mainly covering supervised (regressions, trees, support vector machine, neural networks) and unsupervised methods (clustering, PCA, FAMD, Auto-Encoder) but also other topics such as data splitting, ensemble methods and metrics.

• Aim of the investigation: major terms should be defined, the question of research (more generally the issue), why it is of interest and relevant in that context.

In the context of this class, our group have had the opportunity to work on an applied project. From scratch, we had to look for some potential dataset for using on real cases what we have learned in class. Thus, we had found an interesting dataset concerning vehicule MPG, range, engine stats and more, for more than 100 brands. The goal of our research was to predict the make (i.e. the brand) of the car according to its characteristics (consumption, range, fuel type, … ) thanks to a model that we would have trained (using RF, ANN or Trees). As some cars could have several identical characteristics, but could differentiate on various other ones, we thought that it would be pertinent to have a model that was able to predict a car brand, from its features.

• Description of the data and the general material provided and how it was made available (and/or collected, if it is relevant). Only in broad terms however, the data will be further described in a following section. Typically, the origin/source of the data (the company, webpage, etc.), the type of files (Excel files, etc.), and what it contains in broad terms (e.g. “a file containing weekly sales with the factors of interest including in particular the promotion characteristics”).

The csv dataset has been found on data.world, a data catalog platform that gather various open access datasets online. The file contains more than 45’000 rows and 26 columns, each colomn concerning one feature (such as the year of the brand, the model, the consumption per barrel, the highway mpg per fuel type and so on).

• The method that is used, in broad terms, no details needed at this point. E.g. “Model based machine learning will help us quantifying the important factors on the sales”.

Among these columns, we have had to find a machine learning model that could help us quantify the importance of the features in predicting the make of the car. Various models will be tried for both supervised and unsupervised learnings.

• An outlook: a short paragraph indicating from now what will be treated in each following sections/chapters. E.g. “in Section 3, we describe the data. Section 4 is dedicated to the presentation of the text mining methods…” In the following sections, you will find 1st the description in the data, then in Section 2 the method used, in Section 3 the results, in Section 4 our conclusion and recommendations and finally in Section 5 our references. From now on, we will go through different sections. Section 2 will be dedicated in the data description in more depth, mentioning the variables and features, the instances, the type of data and eventually some missing data patterns. Then, the next section will cover Exploratory Data Analysis (EDA), where some vizualisations will be made in order to better perceive some patterns in the variables as well as potential correlation. After that, section 4 will be about the methods which will first be divided between Supervised and then Unsupervised in order to find a suitable model for our project. The results will be discussed right after and we will proceed with a conclusion, as well as recommendations and discussions. Finally, the references and appendix will be visible at the end of the report.

2 Data description

  • Description of the data file format (xlsx, csv, text, video, etc.) DONE
  • The features or variables: type, units, the range (e.g. the time, numerical, in weeks from January 1, 2012 to December 31, 2015), their coding (numerical, the levels for categorical, etc.), etc. TABLE-NTBF
  • The instances: customers, company, products, subjects, etc. DONE
  • Missing data pattern: if there are missing data, if they are specific to some features, etc. NTBD
  • Any modification to the initial data: aggregation, imputation in replacement of missing data, recoding of levels, etc. NTBD
  • If only a subset was used, it should be mentioned and explained; e.g. inclusion criteria. Note that if inclusion criteria do not exist and the inclusion was an arbitrary choice, it should be stated as such. One should not try to invent unreal justifications. NTBD

“For this project, we selected a dataset focused on vehicle characteristics, available as a .csv file from data.world. You can access the dataset via the following link: data.world. It includes a total of 26 features describing 45,896 vehicle models released between 1984 and 2023. Below is a table providing an overview of the available features and their descriptions. You can find a deeper description of the data in ?@sec-Annex.”

2.0.1 The features or variables: type, units,…

Variable Name Explanation
ID Number corresponding to the precise combination of the features of the model
Model Year Year of the model of the car
Make Brand of the car
Model The model of the car
Estimated Annual Petroleum Consumption (Barrels) Consumption in Petroleum Barrels
Fuel Type 1 First fuel energy source, only source if not an hybrid car
City MPG (Fuel Type 1) Consumption of the car in meter per gallon of fuel when driving in a city, for fuel type 1
Highway MPG (Fuel Type 1) Consumption of the car in meter per gallon of fuel when driving on a highway, for fuel type 1
Combined MPG (Fuel Type 1) Combined city and highway car consumption in meter per gallon, for fuel type 1
Fuel Type 2 Second energy source if hybrid car
City MPG (Fuel Type 2) Consumption of the car in meter per gallon of fuel when driving in a city, for fuel type 2
Highway MPG (Fuel Type 2) Consumption of the car in meter per gallon of fuel when driving on a highway, for fuel type 2
Combined MPG (Fuel Type 2) Combined city and highway car consumption in meter per gallon, for fuel type 2
Engine Cylinders Number of cylinders of the car
Engine Displacement Measure of the cylinder volume swept by all of the pistons of a piston engine, excluding the combustion chambers
Drive
Engine Description Description of the car, e.g. Turbo, Stop-Start, ...
Transmission Manual/Automatic transmission, with number of gears and/or model of transmission
Vehicle Class e.g. Minivan, Trucks, Midsize,....
Time to Charge EV (hours at 120v)
Time to Charge EV (hours at 240v)
Range (for EV)
City Range (for EV - Fuel Type 1)
City Range (for EV - Fuel Type 2)
Hwy Range (for EV - Fuel Type 1)
Hwy Range (for EV - Fuel Type 2)

2.1 The instances: customers, company, products, subjects, etc.

In a basic instance, each row is concerning one car. We can find in order the ID of the car corresponding to a precise feature observation, then the features as seen in the table before.

2.2 Missing data pattern: if there are missing data, if they are specific to some features, etc.

2.3 Any modification to the initial data: aggregation, imputation in replacement of missing data, recoding of levels, etc.

2.4 If only a subset was used, it should be mentioned and explained; e.g. inclusion criteria. Note that if inclusion criteria do not exist and the inclusion was an arbitrary choice, it should be stated as such. One should not try to invent unreal justifications.

EDA:

Columns description

To begin with our EDA, let’s have a look at our dataset and in particular the characteristics of the columns.

Show the code
#to get a detailed summary
skim(data)
Data summary
Name data
Number of rows 45896
Number of columns 26
_______________________
Column type frequency:
character 8
numeric 18
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
make 0 1.00 3 34 0 141 0
model 0 1.00 1 47 0 4762 0
fuel_type_1 0 1.00 6 17 0 6 0
fuel_type_2 44059 0.04 3 11 0 4 0
drive 1186 0.97 13 26 0 7 0
engine_Description 17031 0.63 1 46 0 589 0
transmission 11 1.00 12 32 0 40 0
vehicle_class 0 1.00 4 34 0 34 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ID 0 1.00 23102.11 13403.10 1.00 11474.75 23090.50 34751.25 46332.00 ▇▇▇▇▇
model_year 0 1.00 2003.61 12.19 1984.00 1992.00 2005.00 2015.00 2023.00 ▇▆▆▇▇
estimated_Annual_Petrolum_Consumption_Barrels 0 1.00 15.33 4.34 0.05 12.94 14.88 17.50 42.50 ▁▇▃▁▁
City_MPG_Fuel_Type_1 0 1.00 19.11 10.31 6.00 15.00 17.00 21.00 150.00 ▇▁▁▁▁
highway_mpg_fuel_type_1 0 1.00 25.16 9.40 9.00 20.00 24.00 28.00 140.00 ▇▁▁▁▁
combined_MPG_Fuel_Type_1 0 1.00 21.33 9.78 7.00 17.00 20.00 23.00 142.00 ▇▁▁▁▁
City_MPG_Fuel_Type_2 0 1.00 0.85 6.47 0.00 0.00 0.00 0.00 145.00 ▇▁▁▁▁
highway_mpg_fuel_type_2 0 1.00 1.00 6.55 0.00 0.00 0.00 0.00 121.00 ▇▁▁▁▁
combined_MPG_Fuel_Type_2 0 1.00 0.90 6.43 0.00 0.00 0.00 0.00 133.00 ▇▁▁▁▁
engine_cylinders 487 0.99 5.71 1.77 2.00 4.00 6.00 6.00 16.00 ▇▇▅▁▁
engine_displacement 485 0.99 3.28 1.36 0.00 2.20 3.00 4.20 8.40 ▁▇▅▂▁
time_to_Charge_EV_hours_at_120v_ 0 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ▁▁▇▁▁
charge_time_240v 0 1.00 0.11 1.01 0.00 0.00 0.00 0.00 15.30 ▇▁▁▁▁
range_for_EV 0 1.00 2.36 24.97 0.00 0.00 0.00 0.00 520.00 ▇▁▁▁▁
range_ev_city_fuel_type_1 0 1.00 1.62 20.89 0.00 0.00 0.00 0.00 520.80 ▇▁▁▁▁
range_ev_city_fuel_type_2 0 1.00 0.17 2.73 0.00 0.00 0.00 0.00 135.28 ▇▁▁▁▁
range_ev_highway_fuel_type_1 0 1.00 1.51 19.70 0.00 0.00 0.00 0.00 520.50 ▇▁▁▁▁
range_ev_highway_fuel_type_2 0 1.00 0.16 2.46 0.00 0.00 0.00 0.00 114.76 ▇▁▁▁▁

The dataset that we are working with contains approx. 46’000 rows and 26 columns. We can see that most of our features are concerning the consumption of the cars. In addition, we notice that some variables contain a lot of missing and that the variable “Time.to.Charge.EV..hours.at.120v.” is only containing 0s. We will be handle these in the section “data cleaning”.

Exploration of the distribution

Here are more details about the distribution of the numerical features.

Show the code
#  melt.data <- melt(data)
# 
#  ggplot(data = melt.data, aes(x = value)) + 
# stat_density() + 
# facet_wrap(~variable, scales = "free")
plot_histogram(data)# Time.to.Charge.EV..hours.at.120v. not appearing because all observations = 0 

As most of our models in our dataset are not EVs or hybrid cars, it results in having many 0s in our columns. This issue will be tackled in the cleaning data section. Also, some features are numerical discrete, as we can see on the plot of the column “Engine Cylinders”

Outliers Detection

For each one of our numerical column, let’s check thank to the boxplot the outliers per feature

Show the code
#tentative boxplots

data_long <- data %>%
  select_if(is.numeric) %>%
  pivot_longer(cols = c("ID", 
                        "model_year", 
                        "estimated_Annual_Petrolum_Consumption_Barrels", "City_MPG_Fuel_Type_1",
                        "highway_mpg_fuel_type_1", 
                        "combined_MPG_Fuel_Type_1", 
                        "City_MPG_Fuel_Type_2", 
                        "highway_mpg_fuel_type_2", 
                        "combined_MPG_Fuel_Type_2", 
                        "time_to_Charge_EV_hours_at_120v_", 
                        "charge_time_240v", 
                        "range_for_EV", 
                        "range_ev_city_fuel_type_1", 
                        "range_ev_city_fuel_type_2", 
                        "range_ev_highway_fuel_type_1", 
                        "range_ev_highway_fuel_type_2"), names_to = "variable", values_to = "value")

ggplot(data_long, aes(x = variable, y = value, fill = variable)) +
  geom_boxplot(outlier.size = 0.5) +  # Make outlier points smaller
  facet_wrap(~ variable, scales = "free_y") +  # Each variable gets its own y-axis
  theme_minimal() +
  theme(legend.position = "none",  # Hide the legend
        axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 0),strip.text = element_text(size = 7)) +  # Rotate x-axis labels
  labs(title = "Boxplots of Variables with Different Scales", x = "", y = "Value")

Most of our boxplots are showing extreme values. Again, this is due to the small amount of EV and hybrid cars in our dataset compared to the rest of the models and due to the nature of some feature, concerning only those type of vehicles.

Show the code
#Now 
# plot_correlation(data) #drop time charge EV 120V
# create_report(data)
#nb cars per brand

number of models per make

Now let’s check how many models per make we have in our dataset. In order to have a clear plot, we have decided to keep the top 20 brands among all the make on the graph. All the remaining makes are accessible on the table just below.

Show the code
#Number of occurences/model per make 
nb_model_per_make <- data %>%
  group_by(make, model) %>%
  summarise(Number = n(), .groups = 'drop') %>%
  group_by(make) %>%
  summarise(Models_Per_Make = n(), .groups = 'drop') %>%
  arrange(desc(Models_Per_Make))

#makes with only 1 model 
only_one_model_cars <- nb_model_per_make %>% 
  filter(Models_Per_Make == 1 )

#table globale
datatable(nb_model_per_make,
          rownames = FALSE,
          options = list(pageLength = 10,
                         class = "hover",
                         searchHighlight = TRUE))
Show the code
#table only 1 model
datatable(only_one_model_cars,
          rownames = FALSE,
          options = list(pageLength = 10,
                         class = "hover",
                         searchHighlight = TRUE))
Show the code
# Option to limit to top 20 makes for better readability
top_n_makes <- nb_model_per_make %>% top_n(20, Models_Per_Make)

# Reordering the Make variable within the plotting code to make it ordered by Models_Per_Make descending
# nb_model_per_make$Make <- factor(nb_model_per_make$Make, levels = nb_model_per_make$Make[order(-nb_model_per_make$Models_Per_Make)])
Show the code
ggplot(top_n_makes, aes(x = reorder(make, Models_Per_Make), y = Models_Per_Make)) +
  geom_bar(stat = "identity", color = "black", fill = "grey", show.legend = FALSE) +
  labs(title = "Models per Make (Top 20)",
       x = "Make",
       y = "Number of Models") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
        axis.text.y = element_text(hjust = 1, size = 10),
        plot.title = element_text(size = 14)) +
  coord_flip()  # Flip coordinates for better readability

Show the code
ggplot(only_one_model_cars, aes(x = make, y = Models_Per_Make)) +
  geom_bar(stat = "identity", color = "black", fill = "grey", show.legend = FALSE) +
  labs(title = "Makes with only 1 model",
       x = "Make",
       y = "Number of Models") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
        axis.text.y = element_text(hjust = 1, size = 10),
        plot.title = element_text(size = 14)) +
  coord_flip()  # Flip coordinates for better readability

Therefore, we can see that Mercendes-Benz and BMW have significantly more models in our dataset, which means that we are dealing with some imbalances in categories. Therefore, we need to be careful when doing predictions, as will may encounter bias toward these two majority classes. Therefore, there are few technics that can be used to deal with this problem, such as resampling technics, Ensemble Methods (RF, Boosting), tuning probability threshold

https://chatgpt.com/c/09a66e4e-80c6-4fbd-bf4e-73a2b3e44afd

Correlation matrix for numerical features

Show the code
#Here I encounter some problems with some of the variables. 

#check NAs
# colSums(is.na(data))

data_corrplot <- data %>%
  select_if(is.numeric)

# Identify constant columns (columns with zero standard deviation)
constant_columns <- sapply(data_corrplot, function(x) sd(x, na.rm = TRUE) == 0)

# Print constant columns for inspection
# print("Constant columns (standard deviation is zero):")
# print(names(data_corrplot)[constant_columns])

# Remove constant columns
data_corrplot <- data_corrplot[, !constant_columns]

# Correlation transformation for plot using complete observations
cor_matrix <- cor(data_corrplot, use = "complete.obs")

# Melt correlation matrix for plotting
cor_melted <- melt(cor_matrix)

# Plot correlation matrix heatmap using ggplot2
ggplot(data = cor_melted, aes(Var1, Var2, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white",
                       midpoint = 0, limit = c(-1, 1), space = "Lab",
                       name = "Pearson\nCorrelation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 8, hjust = 1),
        axis.text.y = element_text(size = 8)) +
  coord_fixed() +
  labs(x = '', y = '', title = 'Correlation Matrix Heatmap')

3 Data cleaning

In this section we will handle the missing value of our dataset to make sure that we have a clean dataset to perform our EDA and modeling. We will first visualize the missing values in our dataset and then clean the missing values in the columns that we will use for our analysis. We will also remove some rows and columns that are not relevant for our analysis.

Let’s have a look at the entire dataset and its missing values in grey.

We can see that overall, we do not have many missing values in proportion with the size of our dataset. However, we can see that some columns have a lot of missing values. Below we have the detail of the percentage of missing values by columns.

3.1 Dealing with the columns Engine Cylinders and Engine Displacement

As we can see we have missing in 6 columns. Let’s first have a closer look at the engine cylinders and engine displacement columns. They both have 484 missing values. After some data manipulation, we see that these 484 missing are all electric vehicles and that they all have missing values in the engine cylinders and engine displacement columns. Given that in our dataset we have 484 vehicles, we now that theses missing in these column only concerns electric vehicles. This make sense since electric vehicle do not have an combustion engine and therefore those categories are not really applicable. We will therefore replace all missing values in this two columns with “none”.

As we can see, we still have some missing in the columns “Fuel Type 2”, “Engine Description”, “Drive” and “Transmission”. Let’s investigate the missing in the column “Drive”.

3.2 Dealing with the column Drive, Transmission and Engine Description

We decided to drop the brand with more than 10% of missing values in the “Drive” column. After this operation, we also removed the 8 observations that remained with missing values in the “Transmission” column. We decided to drop the column engine description since it contains missings values for more than a third of our observation.

3.3 Final dataset

The dataset is now cleaned and does not contain any missing values. It contains 42240 observations, 18 features and 129 brands. We renamed the columns and stored it in a csv file (data_cleaned.csv). However, for some models, we need to tackle the unbalanced classes in the target variable. For this reason we also created a new csv file for which we drop the make with less than 10 models (data_cleaned_reduced.csv). This dataset contains 42061 observations, 18 features and 66 brands.

Here are the cleaned datasets on which we are working on from now on.

4 EDA:

4.1 Columns description

To begin with our EDA, let’s have a look at our dataset and in particular the characteristics of the columns.

Show the code
#to get a detailed summary
skim(data)
Data summary
Name data
Number of rows 42240
Number of columns 18
_______________________
Column type frequency:
character 8
numeric 10
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
make 0 1 3 34 0 129 0
vehicle_class 0 1 4 34 0 34 0
drive 0 1 13 26 0 7 0
engine_cylinders 3 1 1 4 0 10 0
engine_displacement 2 1 1 4 0 67 0
transmission 0 1 12 32 0 39 0
fuel_type_1 0 1 6 17 0 6 0
fuel_type_2 0 1 3 11 0 5 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
model_year 0 1 2004.62 11.86 1984 1994 2006 2015 2023.00 ▆▆▆▇▇
city_mpg_fuel_type_1 0 1 19.11 10.63 6 15 17 21 150.00 ▇▁▁▁▁
highway_mpg_fuel_type_1 0 1 25.06 9.66 9 20 24 28 140.00 ▇▁▁▁▁
city_mpg_fuel_type_2 0 1 0.90 6.72 0 0 0 0 145.00 ▇▁▁▁▁
highway_mpg_fuel_type_2 0 1 1.06 6.79 0 0 0 0 121.00 ▇▁▁▁▁
range_ev_city_fuel_type_1 0 1 1.76 21.77 0 0 0 0 520.80 ▇▁▁▁▁
range_ev_highway_fuel_type_1 0 1 1.64 20.52 0 0 0 0 520.50 ▇▁▁▁▁
range_ev_city_fuel_type_2 0 1 0.19 2.85 0 0 0 0 135.28 ▇▁▁▁▁
range_ev_highway_fuel_type_2 0 1 0.18 2.57 0 0 0 0 114.76 ▇▁▁▁▁
charge_time_240v 0 1 0.12 1.06 0 0 0 0 15.30 ▇▁▁▁▁

The dataset that we are working with contains approx. 46’000 rows and 26 columns. We can see that most of our features are concerning the consumption of the cars. In addition, we notice that some variables contain a lot of missing and that the variable “Time.to.Charge.EV..hours.at.120v.” is only containing 0s. We will be handle these in the section “data cleaning”.

4.2 Exploration of the distribution

Here are more details about the distribution of the numerical features.

Show the code
#  melt.data <- melt(data)
# 
#  ggplot(data = melt.data, aes(x = value)) + 
# stat_density() + 
# facet_wrap(~variable, scales = "free")
plot_histogram(data)# Time.to.Charge.EV..hours.at.120v. not appearing because all observations = 0 

We notice that most of our observations of our features are 0s because of the nature of the features. For instance, as most of our cars are not hybrids, they have a unique type of fuel and don’t have any type 2, which results in a 0 in the concerned features. Also, some features are numerical discrete, as we can see on the plot of the column “Engine Cylinders”.

4.3 Outliers Detection

For each one of our numerical column, let’s check thank to the boxplot the outliers per feature

Show the code
#tentative boxplots

data_long <- data %>%
  select_if(is.numeric) %>%
  pivot_longer(cols = c("model_year", "city_mpg_fuel_type_1", "highway_mpg_fuel_type_1", "city_mpg_fuel_type_2", "highway_mpg_fuel_type_2", "range_ev_city_fuel_type_1", "range_ev_highway_fuel_type_1", "range_ev_city_fuel_type_2", "range_ev_highway_fuel_type_2", "charge_time_240v"), names_to = "variable", values_to = "value")



ggplot(data_long, aes(x = variable, y = value, fill = variable)) +
  geom_boxplot(outlier.size = 0.5) +  # Make outlier points smaller
  facet_wrap(~ variable, scales = "free_y") +  # Each variable gets its own y-axis
  theme_minimal() +
  theme(legend.position = "none",  # Hide the legend
        axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 0),strip.text = element_text(size = 7)) +  # Rotate x-axis labels
  labs(title = "Boxplots of Variables with Different Scales", x = "", y = "Value")

Show the code
#Now 
# plot_correlation(data) #drop time charge EV 120V
# create_report(data)
#nb cars per brand

4.4 number of models per make

As we got rid of XXX, …..

4.5 Correlation matrix for numerical features

Show the code
#Here I encounter some problems with some of the variables. 

#check NAs
# colSums(is.na(data))

data_corrplot <- data %>%
  select_if(is.numeric)

# Identify constant columns (columns with zero standard deviation)
constant_columns <- sapply(data_corrplot, function(x) sd(x, na.rm = TRUE) == 0)

# Print constant columns for inspection
print("Constant columns (standard deviation is zero):")
[1] "Constant columns (standard deviation is zero):"
Show the code
print(names(data_corrplot)[constant_columns])
character(0)
Show the code
# Remove constant columns
data_corrplot <- data_corrplot[, !constant_columns]

# Correlation transformation for plot using complete observations
cor_matrix <- cor(data_corrplot, use = "complete.obs")

# Melt correlation matrix for plotting
cor_melted <- melt(cor_matrix)

# Plot correlation matrix heatmap using ggplot2
ggplot(data = cor_melted, aes(Var1, Var2, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white",
                       midpoint = 0, limit = c(-1, 1), space = "Lab",
                       name = "Pearson\nCorrelation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 8, hjust = 1),
        axis.text.y = element_text(size = 8)) +
  coord_fixed() +
  labs(x = '', y = '', title = 'Correlation Matrix Heatmap')

5 Classification Tree

In this section we are going to perform a classification tree analysis on the dataset. We will first load the necessary packages and the dataset. We then prepare the data by encoding categorical variables and splitting it into training and testing sets. We then tried to pruned the tree with different max_depth values to find the optimal tree depth that balances between training and test accuracy.

We first loaded the dataset and identified make as the target variable. We also encoded categorical variables using Label Encoding to convert them into numerical values.

We then splited the dataset into training (80%) and testing (20%) sets to be able to evaluate the model’s performance on unseen data after the training to check wheter the model is overfitting or not. We will see that it does.

Trained a Decision Tree classifier on the training data without any constraints. The “None” case below represent the case without the pruning of the tree. As we can see, we observed overfitting, with high accuracy on training data and slightly lower accuracy on test data. Therefore, we decided to prune the tree as it as the advantage so simplify models and therefore limit overfitting. We chose to prune the tree by trying a few max_depth parameter values to control the tree’s growth (none, 5, 10, 15, 20, 25, 30). We want here to find the optimal tree depth that balances between training and test accuracy.

max_depth   Training Accuracy   Test Accuracy
5       0.2605      0.2550
10      0.4887      0.4674
15      0.7203      0.6352
20      0.8520      0.6926
25      0.8898      0.6972
30      0.8939      0.6971
None        0.8939      0.6979

The model’s accuracy improved as the tree’s depth increased up to a point, with a max_depth of 25 or 30 providing the best test accuracy up to 70%. We see that reducing the max_depth to 10 or 15 improves the balance between, therefore reduce drastically the case of overfitting but this is at the expense of the accuracy of our model on new data. But we can see that pruning the tree with a max depth of 25 allows us to increase our accuracy from 69.84% to 70% therefore increasing the accuracy of our model and at the same time, it reduce the gap between the test set and the trainig set. In our case, pruning the Decision Tree helps in improving its generalization performance by preventing it from becoming too complex and reduce overfitting the training data.

6 Neural Network

In this section, we will build a neural network model to predict the make of a car based on the features at our disposal. We will preprocess the data, split it into training and testing sets, define the neural network architecture, compile the model, train it and evaluate its performance.

6.1 Preprocessing and splitting the data

The dataset contains different types of data. Some columns are numerical (like “city_mpg_fuel_type_1” or “charge_time_240v”), and some are categorical (“vehicle_class” or “fuel_type”). We identify and separate these two types of columns. Separating numerical and categorical columns is an essential step in data preprocessing because they require different types of handling to prepare them for machine learning algorithms. The numerical columns need to be scaled by adjusting them so they have a mean of zero and a standard deviation of one, which helps the machine learning algorithm perform better. While the categorical columns need to be one-hot encoded which creates a binary column a format that the machine learning model can understand.

The data is split into two parts: training and testing. The training set is used to train the model, and the testing set is used to evaluate its performance. This split ensures that we can test how well the model generalizes to new, unseen data.

6.1.1 Building the neural network models and training them

6.1.1.1 Base Neural Network

We chose to use a neural network. This neural network consists of layers of neurons, where each layer applies transformations to the data. The first layer takes the input features. Then some Hidden layers help the model learn complex patterns. In the end, the output layer predicts the probability of each car manufacturer. The first layes, the input layer, takes the input features. The second layers is set to 128 neurons, the third to 64 neurons and the last layer, the output layer, has as many neurons as there are car manufacturers. The activation function used in the hidden layers is the Rectified Linear Unit (ReLU), and the output layer uses the Softmax activation function. The model is compiled with the Adam optimizer and the categorical crossentropy loss function.

# Define the neural network model
model_no_dropout = Sequential([
    Input(shape=(X_train.shape[1],)),
    Dense(128, activation='relu'),
    Dense(64, activation='relu'),
    Dense(y_train.shape[1], activation='softmax')])

We used activation functions in the hidden layers to introduce non-linearity into the model. The ReLU activation function is used in the hidden layers because it is computationally efficient and helps the model learn complex patterns in the data. The Softmax activation function is used in the output layer because it converts the model’s raw output into probabilities that sum to one. This allows us to interpret the model’s output as the probability of each car manufacturer.

We used the following hyperparameters (non-exhaustive list):

  • epochs: 5 (Corresponds to the number of times the model sees the entire dataset during training.)
  • batch_size: 32 (Corresponds to the number of samples that the model processes before updating the weights.)
  • validation_split: 0.2 (Corresponds to the fraction of the training data to be used as validation data.)

The model is trained for 5 epochs with a batch size of 32. The validation split is set to 0.2, which means that 20% of the training data is used for validation.

Overall, the model perform well but we have’t dealt with the issue of unbalanced classes yet. Let’s have a look at the distribution of the sensitivity and specificity for each class.

6.1.1.1.1 Issue of unbalanced classes

The isse of unbalanced classes, as explained previously can highly weekend the model ability to generalize to new data. The model, will automatically prefer to predict the most frequent classes. We can see in the boxplots below the distribution of the sensitivity and specificity for the classes. Even though, we already dealt in part with the unbalanced class during the cleaning process, as seen in the graph (((((((((((((((add graph reference))))))))))))))), there’s still big differences between the classes.

{'whiskers': [<matplotlib.lines.Line2D object at 0x377affd40>, <matplotlib.lines.Line2D object at 0x37b675c10>, <matplotlib.lines.Line2D object at 0x37b674d10>, <matplotlib.lines.Line2D object at 0x37b674c80>], 'caps': [<matplotlib.lines.Line2D object at 0x37b6765a0>, <matplotlib.lines.Line2D object at 0x37b6753d0>, <matplotlib.lines.Line2D object at 0x37b677620>, <matplotlib.lines.Line2D object at 0x37b677b90>], 'boxes': [<matplotlib.lines.Line2D object at 0x375973cb0>, <matplotlib.lines.Line2D object at 0x37b675370>], 'medians': [<matplotlib.lines.Line2D object at 0x37b677860>, <matplotlib.lines.Line2D object at 0x37b677e30>], 'fliers': [<matplotlib.lines.Line2D object at 0x37b675220>, <matplotlib.lines.Line2D object at 0x37b76c140>], 'means': []}

By examining the boxplot representing the distribution of sensitivity and specificity across the classes, we observe clear evidence of class imbalance. Sensitivity and specificity are not consistent across the classes. Specificity, which measures how well the model identifies true negatives, is very high for every class. This indicates that the model is effective at detecting instances that do not belong to a given class. However, sensitivity, which measures the true positive rate and reflects how well the model correctly predicts the make of a car for a specific brand, is not as high. This suggests that the model is not performing well for all classes.

For some classes with more vehicle models, the model tends to predict those classes more frequently, leading to higher accuracy but lower sensitivity for rarer classes. To address this issue, we will use class weights to ensure the model performs more evenly across all classes.

6.1.1.1.2 Adding class weights to the model

This technique has been explained in the methods section but in a nutshell, it is a way to penalize the model for misclassifying the minority class more than the majority class. This helps the model to learn the patterns in the minority class better and improve its performance on the test set.

{'whiskers': [<matplotlib.lines.Line2D object at 0x37d1fe330>, <matplotlib.lines.Line2D object at 0x37d1feb70>, <matplotlib.lines.Line2D object at 0x37d1dcc80>, <matplotlib.lines.Line2D object at 0x37d1dcf50>], 'caps': [<matplotlib.lines.Line2D object at 0x37d1ff020>, <matplotlib.lines.Line2D object at 0x37d1dc620>, <matplotlib.lines.Line2D object at 0x37d1fddf0>, <matplotlib.lines.Line2D object at 0x37d1ffbf0>], 'boxes': [<matplotlib.lines.Line2D object at 0x37ca4ae10>, <matplotlib.lines.Line2D object at 0x37d1dc980>], 'medians': [<matplotlib.lines.Line2D object at 0x37d1dc350>, <matplotlib.lines.Line2D object at 0x37d1fd970>], 'fliers': [<matplotlib.lines.Line2D object at 0x37d1dc800>, <matplotlib.lines.Line2D object at 0x37d1fcc80>], 'means': []}

As we can see the model is taking better care of the minority classes and overall the sensitivity is higher across the classes. The sensitivity and specificity are more consistent across the classes. The model is better at generalizing to new data. In our case, this method does not eliminate completely the issue of unbalanced classes. Given the structure of our data and the discrepancy of our classes, we will use this technique for the following neural networks and move on.

6.1.1.1.3 Model performance

We can now look at the evolution of the accuracy of our model during the training process with the following plot.

As we can see, at each epoch, the accuracy is increasing and the loss is decreasing. The model is learning from the training data and improving its predictions.

But, in the end, we have a case of overfitting. The model performs well on the training data but not as well on the testing data. This is an issue because it limits the possibility of generalizing the model to new data.

Performance of the model with weighted class: ::: {.cell} ::: {.cell-output .cell-output-stdout}

Final Training Accuracy: 77.42%

:::

Final Validation Accuracy: 72.01%
Test Set Accuracy: 71.14%

:::

Overall, the performance of the model is still good. However the quality can be improved. To address the issue of overfitting, we will introduce Dropout layers in the neural network.

6.1.1.2 Neural Network with Dropout layers

Dropout layers randomly set a fraction of input units to zero during training, which helps prevent overfitting by forcing the model to learn more robust features. We will tune the dropout rate to find the optimal value that balances training and validation accuracy and that insure to reduce overfitting.

We used the following hyperparameters (non-exhaustive list):

  • epochs: 5 (Corresponds to the number of times the model sees the entire dataset during training.)
  • batch_size: 32 (Corresponds to the number of samples that the model processes before updating the weights.)
  • validation_split: 0.2 (Corresponds to the fraction of the training data to be used as validation data.)
  • dropout_rate: varies (Corresponds to the fraction of neurons to drop during training.)

We will try 5 different dropout rates in addition of the case of no dropout. We will train the model with each dropout rate and evaluate its performance on the validation and test sets. We will then plot the training, validation, and test accuracies for each dropout rate to find the optimal value.

# Function to create and compile the model
def create_model(dropout_rate=0.0):
    model = Sequential([
        Input(shape=(X_train.shape[1],)),
        Dense(128, activation='relu'),
        Dropout(dropout_rate),
        Dense(64, activation='relu'),
        Dropout(dropout_rate),
        Dense(y_train.shape[1], activation='softmax')
    ])
    model.compile(optimizer=Adam(), loss='categorical_crossentropy', metrics=['accuracy'])
    return model

We can see that the model with a dropout rate of 0.2 has the best performance on the test set. This model has a good balance between training and validation accuracy, and it generalizes well to new data. It also eliminate the overfitting issue. We will use this dropout rate of 0.2 to train the final model with dropout layers.

We see that the model with dropout layers performs better that the one without it. We reached a better accuracy on the validation set and the model is not overfitting anymore.

Performance of the model with weighted class and dropout layers: ::: {.cell} ::: {.cell-output .cell-output-stdout}

Final Training Accuracy: 62.59%

:::

Final Validation Accuracy: 66.21%
Test Set Accuracy: 65.41%

:::

Our final accuracy of our model are not as great as we had with our first model but the model we are using is at least better at representing the data and generalizing to new data.